The mathematical procedures we use in least squares ensure that:
\(1\). the mean of the residuals is always zero. Because we included an intercept (\(a\)), and the regression line goes through the point of averages, the mean of the residuals is always 0. \(\overline{e} = 0\). This is also true of residuals of the mean.
We choose \(\begin{pmatrix}a \\ b \end{pmatrix}\) such that \(e\) is orthogonal to \(\mathbf{X}\). One column of \(\mathbf{X}\) is all \(1\)s, to get the intercept (recall how we used vectors to get the mean). So \(e\) is orthogonal to \(\begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}\).
\[\mathbf{1}'e = 0\]
And if this is true, the \(\sum e_i = 0\) so \(\frac{1}{n}\sum e_i = 0\).
The mathematical procedures we use in least squares ensure that:
\(2\). \(Cov(X,e) = 0\). This is true by definition of how we derived least squares. We chose \(\beta\) (\(a,b\)) such that \(X'e = 0\) so they would be orthogonal. \(X'e = 0 \to \sum x_ie_i = 0 \to \overline{xe}=0\); \(\overline{e}=0\) from above; so \(Cov(X,e) = \overline{xe}-\overline{x} \ \overline{e} = 0 - \overline{x}0 = 0\).
This also means that residuals \(e\) are always perfectly uncorrelated (in the mathematical sense) with all the columns in our matrix \(\mathbf{X}\): all the variables we include in the regression model.
\[\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}; \mathbf{Y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}; \beta = \begin{pmatrix} a \\ b \end{pmatrix}\]
\[(X_{p \times n}'X_{n \times p})^{-1}X_{p \times n}'Y_{n \times 1} = \beta_{p \times 1}\]
\[b = \frac{Cov(x,y)}{Var(x)} = r \frac{SD_y}{SD_x}\]
\[a = \overline{y} - \overline{x}\cdot b\]
\[\widehat{Y_i} = a + b \cdot x_i\]
\[\widehat{Y}_{n \times 1} = \mathbf{X}_{n \times p}\beta_{p \times 1}\]
\[\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} = \widehat{Y} = \begin{pmatrix} \hat{y_1} \\ \vdots \\ \hat{y_n} \end{pmatrix}\]
In order for calculations of least squares to be possible, two things must be true:
RLet’s look at some data:
| fheight | sheight |
|---|---|
| 65.04851 | 59.77827 |
| 63.25094 | 63.21404 |
| 64.95532 | 63.34242 |
| 65.75250 | 62.79238 |
| 61.13723 | 64.28113 |
| 63.02254 | 64.24221 |
RR#"X" matrix
X = cbind(1, father.son$fheight)
head(X)## [,1] [,2]
## [1,] 1 65.04851
## [2,] 1 63.25094
## [3,] 1 64.95532
## [4,] 1 65.75250
## [5,] 1 61.13723
## [6,] 1 63.02254
#Y matrix
Y = father.son$sheight
head(Y)## [1] 59.77827 63.21404 63.34242 62.79238 64.28113 64.24221
Rbeta = solve(t(X) %*% X) %*% t(X) %*% YWhat should the dimension of beta be?
Rbeta = solve(t(X) %*% X) %*% t(X) %*% Y
beta## [,1]
## [1,] 33.886604
## [2,] 0.514093
RTypically we use lm function. It assumes an intercept by default
#lm(y ~ x1 + x2 + ... + xn, data = data.frame.with.our.variables)
ls = lm(sheight ~ fheight, data = father.son)RDid we get the same results?
summary(ls)$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
## fheight 0.514093 0.02704874 19.00618 1.121268e-69
beta## [,1]
## [1,] 33.886604
## [2,] 0.514093
RDo residuals have the expected properties?
y_hat = beta[1,] + beta[2,] * father.son$fheight
#Get residuals
e = father.son$sheight - y_hat
#Are residuals mean 0?
mean(e)## [1] -5.945586e-12
#Are residuals orthogonal with X?
t(e) %*% X## [,1] [,2]
## [1,] -6.409351e-09 -4.322255e-07
cov(e, X)## [,1] [,2]
## [1,] 0 1.49169e-12
RDo residuals have the expected properties?
#Do the residuals have 0 covariance with X?
cov(e, X)## [,1] [,2]
## [1,] 0 1.49169e-12
RDo residuals have the expected properties?
RWe observe what we expect mathematically:
RLet’s consider this data:
RLet’s use least squares
#"X" matrix
X = cbind(1, x1)
#Y matrix
Y = y1
#Get Beta
beta = solve(t(X) %*% X) %*% t(X) %*% YRLet’s use least squares
RDo residuals have the expected properties?
#Get y hat
y_hat = X %*% beta
#Get residuals
e = Y - y_hat
#Are residuals mean 0?
mean(e)## [1] -2.29981e-17
#Is covariance of residuals and X == 0?
cov(e, X)## x1
## [1,] 0 2.197894e-15
RDo residuals have the expected properties?
Even though residuals are mechanically uncorrelated with \(X\)
lm(sheight ~ fheight, data = father.son) %>% summary %>% .$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
## fheight 0.514093 0.02704874 19.00618 1.121268e-69
Is the intercept substantively meaningful?
How do you interpret the slope?
Let’s center father’s height on its mean…
father.son$fheight_centered = father.son$fheight - mean(father.son$fheight)
lm(sheight ~ fheight_centered, data = father.son) %>% summary %>% .$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.684070 0.07421078 925.52689 0.000000e+00
## fheight_centered 0.514093 0.02704874 19.00618 1.121268e-69
Is the intercept substantively meaningful?
How do you interpret the slope?
This supposed to capture “goodness-of-fit”/“explained variance”
But this is a concept that is not about causality, just about how well variance in \(Y\) can be “predicted” by \(X\).
total sum of squares: variance of the outcome. (\(\frac{1}{n}\sum (Y_i - \bar{Y})^2\))
residual sum of squares: variance of residuals (\(\frac{1}{n}\sum (Y_i - \hat{Y_i})^2\))
model sum of squares: variance of model estimates (\(\frac{1}{n}\sum (\hat{Y_i} - \bar{Y})^2\))
R-squared: fraction of variance in \(Y\) “explained” by the model. (\(R^2 = 1 - \frac{RSS}{TSS} = \frac{MSS}{TSS}\))
Previously we fit the mean of \(Y\) as a linear function of \(X\):
\[Y_i = a + b \cdot X_i\]
Now, we can imagine fitting mean of \(Y\) as a linear function of many variables:
\[Y_i = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]
Thinking about “projecting onto \(p\) dimensions” can be helpful:
When we project into two dimensions, these dimensions are like the \(x\) and \(y\) axes on a graph: perpendicular to each other. This not an analogy, it is exactly what we do with multi-variate regression, because we are going to project \(Y\) onto the orthogonal components in \(X\).
Examples: Linear Dependence
| 1 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 1 |
Examples: Linear Dependence
| 1 | 2 | 0 | 0 |
| 1 | 0 | 2 | 0 |
| 1 | 0 | 0 | 2 |
Examples: Linear Dependence
| 1 | 0.50 | 0.25 | 0.25 |
| 1 | 0.25 | 0.50 | 0.25 |
| 1 | 0.25 | 0.25 | 0.50 |
When we include more than one variable in the equation, we cannot calculate slopes using simple algebraic expressions like \(\frac{Cov(X,Y)}{Var(X)}\).
We calculate least squares using same matrix equation (\((X'X)^{-1}X'Y\)) as in bivariate regression, but what is the math doing in the multivariate case?
When fitting the equation:
\(Y = b_0 + b_1x + b_2z\)
Where \(x^* = x - \hat{x}\) from the regression: \(x = c_0 + c_1 z\).
Where \(z^* = z - \hat{z}\) from the regression: \(z = d_0 + d_1 x\)
Does anything look familiar here?
More generally:
\[Y = b_0 + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k\]
\(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\)
where \(X_k^* = X_k - \hat{X_k}\) obtained from the regression:
\(X_{k} = c_0 + c_1 x_{1} + \ldots + c_{k-1} X_{(k-1)}\)
\(X_k^*\) is the residual from regressing \(X_k\) on all other \(X_{j \neq k}\)
How do we make sense of \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?
How do we make sense of \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\) (if \(X_k^*\) as residual \(X_k\) after regressing on all other \(X_{j \neq k}\)?)
The slope \(b_k\) is the change in \(Y\) with a one-unit change in the part of \(X_k\) that is uncorrelated with/orthogonal to the other variables in the regression.
better to think of it as “partialling out” the relationship with other variables in \(\mathbf{X}\).
There are additional implications of defining the slope \(b_k = \frac{Cov(X_k^*, Y)}{Var(X_k^*)}\):
Now we can see why columns of \(X\) must be linearly independent:
Let’s say we want to estimate the mean earnings for professionals as a function of hours worked, profession, gender, and age.
Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers. Here we look at UHRSWORK (the usual hours worked per week), SEX (an indicator for female), AGE (in years), LAW (an indicator for being a lawyer), and INCEARN (total annual income earned in dollars). We restrict our sample to full time workers (over 35 hours per week and 40 weeks per year)
Let’s now estimate the relationship between hours worked and earnings, conditioning on age, profession, and gender:
ls = lm(INCEARN ~ UHRSWORK + AGE + LAW + SEX, acs_data)summary(ls)##
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + SEX, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -282796 -89635 -29132 70136 772441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 847.0 9332.0 0.091 0.928
## UHRSWORK 1251.9 113.5 11.030 <2e-16 ***
## AGE 3473.9 124.1 27.991 <2e-16 ***
## LAW -47539.3 2629.5 -18.079 <2e-16 ***
## SEX -39363.2 2786.2 -14.128 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 126900 on 9995 degrees of freedom
## Multiple R-squared: 0.149, Adjusted R-squared: 0.1487
## F-statistic: 437.5 on 4 and 9995 DF, p-value: < 2.2e-16
What are they?
Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise
How do we make them?
R with as.factor()lm(y ~ as.factor(category_var))
What do they do?
summary(lm(INCEARN ~ sex, acs_data))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 143878.05 2358.730 60.99810 0.000000e+00
## sexMale 58918.14 2873.297 20.50541 1.450084e-91
We can include all categories and drop the intercept
summary(lm(INCEARN ~ -1 + sex, acs_data))$coefficients## Estimate Std. Error t value Pr(>|t|)
## sexFemale 143878.0 2358.730 60.9981 0
## sexMale 202796.2 1640.801 123.5958 0
How does our interpretation change?
## Estimate Std. Error t value Pr(>|t|)
## sexFemale 143878.0 2358.730 60.9981 0
## sexMale 202796.2 1640.801 123.5958 0
We can consider a case where there are several categories:
Brexit Vote: What are regional differences in vote for Brexit?
Five regions: Southern England, Midlands, Northern England, Wales, and Scotland
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.61020359 7.7829764 5.7317665 2.048832e-08
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5The Midlands 8.57606186 1.2532253 6.8431923 3.179826e-11
## Region5Northern England 6.35984691 1.3248565 4.8004045 2.293393e-06
## Region5Wales 2.30867033 2.0441434 1.1294072 2.594499e-01
## Region5Scotland -11.60364022 1.8478179 -6.2796448 9.422928e-10
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.18626546 7.7855372 6.8314188 3.420699e-11
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Northern England -2.21621496 1.5559745 -1.4243260 1.551859e-01
## Region5Wales -6.26739154 2.2030506 -2.8448696 4.687357e-03
## Region5Scotland -20.17970209 2.0149071 -10.0152022 4.494699e-21
## Region5Southern England -8.57606186 1.2532253 -6.8431923 3.179826e-11
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.97005050 7.3635671 6.9219239 1.946734e-11
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Wales -4.05117658 2.1752891 -1.8623624 6.333598e-02
## Region5Scotland -17.96348713 1.9097366 -9.4062643 5.303722e-19
## Region5Southern England -6.35984691 1.3248565 -4.8004045 2.293393e-06
## Region5The Midlands 2.21621496 1.5559745 1.4243260 1.551859e-01
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91887392 7.6346802 6.1454930 2.043370e-09
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Scotland -13.91231055 2.4939139 -5.5785047 4.660849e-08
## Region5Southern England -2.30867033 2.0441434 -1.1294072 2.594499e-01
## Region5The Midlands 6.26739154 2.2030506 2.8448696 4.687357e-03
## Region5Northern England 4.05117658 2.1752891 1.8623624 6.333598e-02
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.00656337 7.2248567 4.5684731 6.681290e-06
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Southern England 11.60364022 1.8478179 6.2796448 9.422928e-10
## Region5The Midlands 20.17970209 2.0149071 10.0152022 4.494699e-21
## Region5Northern England 17.96348713 1.9097366 9.4062643 5.303722e-19
## Region5Wales 13.91231055 2.4939139 5.5785047 4.660849e-08
How do you interpret this?
summary(lm(Pct_Lev ~ -1 + Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Scotland 33.00656337 7.2248567 4.5684731 6.681290e-06
## Region5Southern England 44.61020359 7.7829764 5.7317665 2.048832e-08
## Region5The Midlands 53.18626546 7.7855372 6.8314188 3.420699e-11
## Region5Northern England 50.97005050 7.3635671 6.9219239 1.946734e-11
## Region5Wales 46.91887392 7.6346802 6.1454930 2.043370e-09
In R: you can shift the reference category using the following:
levels = c('Scotland', 'Southern England', 'The Midlands', 'Northern England', 'Wales')
brexit$Region5 = factor(brexit$Region5, levels = levels)
The first category is dropped (in this case, Scotland)